Distributed Coordination with ZooKeeper: Leader Election and Service Discovery in Go

February 2, 2026

Distributed Coordination with ZooKeeper: Leader Election and Service Discovery in Go

In previous posts, we built the core components of our distributed search system: TF-IDF ranking, HTTP networking, search workers, and a coordinator. But how do we decide which node becomes the coordinator? What happens when a node crashes?

Enter Apache ZooKeeper — a distributed coordination service that solves these problems elegantly.

The Coordination Problem

In a distributed system, we face several challenges:

  1. Leader Election: Only one node should be the coordinator at any time
  2. Failure Detection: We need to know when nodes crash
  3. Service Discovery: Workers need to find the coordinator, and vice versa
  4. Consistency: All nodes must agree on who the leader is

ZooKeeper provides primitives that make solving these problems straightforward.

ZooKeeper Fundamentals

Znodes

ZooKeeper stores data in a hierarchical namespace, similar to a filesystem. Each node is called a znode. There are two special types:

  • Ephemeral znodes: Automatically deleted when the session that created them ends (perfect for failure detection)
  • Sequential znodes: Have a monotonically increasing counter appended to their name (perfect for ordering)

Our Namespace Structure

/
├── election/
│ ├── c_0000000001 (ephemeral sequential - Node A)
│ ├── c_0000000002 (ephemeral sequential - Node B)
│ └── c_0000000003 (ephemeral sequential - Node C)
├── workers_service_registry/
│ ├── n_0000000001 (ephemeral sequential - "http://worker1:8081/task")
│ └── n_0000000002 (ephemeral sequential - "http://worker2:8082/task")
└── coordinators_service_registry/
└── n_0000000001 (ephemeral sequential - "http://coordinator:8080/search")

Leader Election Algorithm

The election algorithm is beautifully simple:

  1. Each node creates an ephemeral sequential znode under /election
  2. Get all children and sort them lexicographically
  3. If your znode is the smallest, you're the leader
  4. Otherwise, watch the znode immediately before yours
  5. When that znode is deleted, repeat from step 2

This is called the "herd effect" avoidance pattern — only one node wakes up when the leader dies, not all of them.

Implementation

package cluster
const (
ElectionPath = "/election"
ElectionPrefix = "c_"
)
type LeaderElection struct {
conn *zk.Conn
callback OnElectionCallback
currentZnodeName string
mu sync.Mutex
}
// OnElectionCallback defines what happens when role changes
type OnElectionCallback interface {
OnElectedToBeLeader()
OnWorker()
}

Volunteering for Leadership

When a node starts, it creates an ephemeral sequential znode:

func (le *LeaderElection) VolunteerForLeadership() error {
le.mu.Lock()
defer le.mu.Unlock()
znodePath := ElectionPath + "/" + ElectionPrefix
createdPath, err := le.conn.CreateProtectedEphemeralSequential(
znodePath,
[]byte{},
zk.WorldACL(zk.PermAll),
)
if err != nil {
return err
}
// Extract just the znode name from the full path
le.currentZnodeName = createdPath[len(ElectionPath)+1:]
log.Printf("Volunteered for leadership with znode: %s", le.currentZnodeName)
return nil
}

The CreateProtectedEphemeralSequential method is crucial — it handles the case where the connection drops during creation, preventing orphaned znodes.

Running the Election

func (le *LeaderElection) reelectLeaderInternal() error {
// Get all children of the election znode
children, _, err := le.conn.Children(ElectionPath)
if err != nil {
return err
}
// Sort children lexicographically
sort.Strings(children)
// Find our position
ourIndex := -1
for i, child := range children {
if child == le.currentZnodeName {
ourIndex = i
break
}
}
// If we're the smallest (first), we're the leader
if ourIndex == 0 {
log.Printf("Elected as leader with znode: %s", le.currentZnodeName)
le.callback.OnElectedToBeLeader()
return nil
}
// Otherwise, we're a worker - watch our predecessor
predecessorName := children[ourIndex-1]
predecessorPath := ElectionPath + "/" + predecessorName
log.Printf("Not leader. Watching predecessor: %s", predecessorName)
le.callback.OnWorker()
// Set up watch on predecessor
go le.watchPredecessor(predecessorPath)
return nil
}

Watching the Predecessor

func (le *LeaderElection) watchPredecessor(predecessorPath string) {
for {
exists, _, eventChan, err := le.conn.ExistsW(predecessorPath)
if err != nil {
log.Printf("Error watching predecessor: %v", err)
return
}
// Handle race condition: predecessor might have disappeared
if !exists {
log.Printf("Predecessor no longer exists, triggering re-election")
le.ReelectLeader()
return
}
// Wait for an event
event := <-eventChan
if event.Type == zk.EventNodeDeleted {
log.Printf("Predecessor deleted, triggering re-election")
le.ReelectLeader()
return
}
}
}

Service Registry

The service registry allows nodes to discover each other. Workers register their /task endpoint, and the coordinator registers its /search endpoint.

Implementation

type ZkServiceRegistry struct {
conn *zk.Conn
registryPath string
currentZnode string
allServiceAddresses []string
mu sync.RWMutex
}
const (
WorkersRegistryPath = "/workers_service_registry"
CoordinatorsRegistryPath = "/coordinators_service_registry"
)

Registering to the Cluster

func (sr *ZkServiceRegistry) RegisterToCluster(address string) error {
sr.mu.Lock()
defer sr.mu.Unlock()
// Prevent duplicate registration
if sr.currentZnode != "" {
log.Printf("Warning: Already registered, skipping")
return nil
}
// Create ephemeral sequential znode with address as data
znodePath := sr.registryPath + "/n_"
createdPath, err := sr.conn.CreateProtectedEphemeralSequential(
znodePath,
[]byte(address),
zk.WorldACL(zk.PermAll),
)
if err != nil {
return err
}
sr.currentZnode = createdPath
log.Printf("Registered at %s with address %s", createdPath, address)
return nil
}

Discovering Services

func (sr *ZkServiceRegistry) GetAllServiceAddresses() ([]string, error) {
children, _, err := sr.conn.Children(sr.registryPath)
if err != nil {
return nil, err
}
sort.Strings(children)
addresses := make([]string, 0, len(children))
for _, child := range children {
childPath := sr.registryPath + "/" + child
data, _, err := sr.conn.Get(childPath)
if err != nil {
if err == zk.ErrNoNode {
continue // Node was deleted, skip
}
return nil, err
}
if len(data) > 0 {
addresses = append(addresses, string(data))
}
}
return addresses, nil
}

Watching for Changes

The coordinator needs to know when workers join or leave:

func (sr *ZkServiceRegistry) RegisterForUpdates() {
go sr.watchForUpdates()
}
func (sr *ZkServiceRegistry) watchForUpdates() {
for {
children, _, eventChan, err := sr.conn.ChildrenW(sr.registryPath)
if err != nil {
log.Printf("Error watching registry: %v", err)
return
}
// Update cached addresses
sr.updateCachedAddresses(children)
// Wait for changes
event := <-eventChan
if event.Type == zk.EventNodeChildrenChanged {
log.Printf("Registry changed, refreshing addresses")
}
}
}

Failure Scenarios

Let's trace through what happens in various failure scenarios:

Scenario 1: Worker Crashes

  1. Worker's ZooKeeper session ends
  2. Ephemeral znode in /workers_service_registry is deleted
  3. Coordinator's watch fires
  4. Coordinator refreshes worker list
  5. Future tasks are distributed to remaining workers

Scenario 2: Coordinator Crashes

  1. Coordinator's ZooKeeper session ends
  2. Ephemeral znode in /election is deleted
  3. Next worker's watch fires (the one watching the coordinator)
  4. That worker runs re-election
  5. It finds itself as the smallest znode
  6. It becomes the new coordinator
  7. It unregisters from workers, registers as coordinator

Scenario 3: Network Partition

  1. Node loses connection to ZooKeeper
  2. Session timeout expires
  3. All ephemeral znodes for that node are deleted
  4. Other nodes detect the change via watches
  5. When connection is restored, node re-registers

Testing the Election Logic

We use unit tests to verify the sorting and callback logic:

func TestSortZnodeNames(t *testing.T) {
tests := []struct {
name string
input []string
expected []string
}{
{
name: "reverse order",
input: []string{"c_0000000003", "c_0000000002", "c_0000000001"},
expected: []string{"c_0000000001", "c_0000000002", "c_0000000003"},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
result := SortZnodeNames(tt.input)
// ... verify result matches expected
})
}
}
// Mock callback for testing
type MockCallback struct {
LeaderCalled bool
WorkerCalled bool
}
func (m *MockCallback) OnElectedToBeLeader() {
m.LeaderCalled = true
}
func (m *MockCallback) OnWorker() {
m.WorkerCalled = true
}

The Complete Picture

Here's how all the pieces fit together:

┌─────────────────────────────────────────────────────────────┐
│ ZooKeeper Cluster │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ /election │ │ /workers │ │/coordinators│ │
│ │ c_001 ←────┼──┼─────────────┼──┼─────────────┤ │
│ │ c_002 │ │ n_001 │ │ n_001 │ │
│ │ c_003 │ │ n_002 │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │ │
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Node A │ │ Node B │ │ Node C │
│ (Leader)│ │ (Worker)│ │ (Worker)│
│Coordinator │SearchWorker │SearchWorker
└─────────┘ └─────────┘ └─────────┘

Key Takeaways

  1. Ephemeral znodes = automatic failure detection: When a node dies, its znodes disappear
  2. Sequential znodes = ordering: Perfect for leader election
  3. Watches = reactive updates: No polling needed
  4. Herd effect avoidance: Only watch your predecessor, not the leader
  5. Protected creates: Handle connection drops during znode creation

What's Next?

In the next post, we'll implement the OnElectionAction component that ties everything together — handling role transitions when a node becomes a leader or worker.

Get the Code

git clone git@github.com:UnplugCharger/distributed_doc_search.git
git checkout 05-zookeeper-cluster
cd distributed-search-cluster-go
go test ./internal/cluster/... -v

This post is part of the "Distributed Document Search" series. Follow along as we build a production-ready search cluster from scratch.